AITopics | ffn layer

UMoE: Unifying Attention and FFN with Shared Experts

Neural Information Processing SystemsJun-16-2026, 05:16:09 GMT

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Asia > Japan (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

b8f10193cab43d45df9bb810637333fd-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 01:32:59 GMT

large language model, machine learning, sparsity, (20 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (0.93)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications (0.98)
(2 more...)

Add feedback

11fc8c98b46d4cbdfe8157267228f7d7-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 16:10:40 GMT

artificial intelligence, conditional moe, machine learning, (16 more...)

Neural Information Processing Systems

Industry: Information Technology (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

b8f10193cab43d45df9bb810637333fd-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 17:47:42 GMT

large language model, machine learning, sparsity, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Michigan (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications (0.98)
(2 more...)

Add feedback

How does Architecture Influence the Base Capabilities

Neural Information Processing SystemsFeb-17-2026, 01:02:14 GMT

Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models?

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Hong Kong (0.04)
(11 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)

Add feedback

2d8f2351de4e9248d91ffa52dae2e6a2-Paper-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 09:59:27 GMT

attn, ffn, transformer, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

A Transformer-based Neural Architecture Search Method

Wang, Shang, Tang, Huanrong, Ouyang, Jianquan

arXiv.org Artificial IntelligenceNov-3-2025

This paper presents a neural architecture search method based on Transformer architecture, searching cross multihead attention computation ways for different number of encoder and decoder combinations. In order to search for neural network structures with better translation results, we considered perplexity as an auxiliary evaluation metric for the algorithm in addition to BLEU scores and iteratively improved each individual neural network within the population by a multi-objective genetic algorithm. Experimental results show that the neural network structures searched by the algorithm outperform all the baseline models, and that the introduction of the auxiliary evaluation metric can find better models than considering only the BLEU score as an evaluation metric.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.01314

Country:

Europe > Portugal (0.17)
Asia > China (0.16)
North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

UMoE: Unifying Attention and FFN with Shared Experts

Yang, Yuanhang, Wang, Chaozheng, Li, Jing

arXiv.org Artificial IntelligenceOct-24-2025

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, that reveals an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.0726

Country:

Asia (1.00)
North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

How does Architecture Influence the Base Capabilities

Neural Information Processing SystemsOct-10-2025, 11:31:13 GMT

Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models?

base capability, contribution ratio, experiment, (12 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Hong Kong (0.04)
(11 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)

Add feedback

ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall

Yang, Jiayu, Fan, Yuxuan, Lai, Songning, Wu, Shengen, Tang, Jiaqi, Kang, Chun, Guo, Zhijiang, Yue, Yutao

arXiv.org Artificial IntelligenceOct-10-2025

Large Language Models (LLMs) require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answer, a dynamic prior KE work has overlooked. Guided by this insight, we propose ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall, a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. ACE provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.07896

Country: